Phase II: Using Our Toolbox

class: center, middle, inverse, title-slide

.title[
# Phase II: Using Our Toolbox
]
.subtitle[
## Module 6: Spatial Awareness
]
.author[
### Dr. Christopher Kenaley
]
.institute[
### Boston College
]
.date[
### 2025/11/03
]

---

class: top
# In class today

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.14.0/css/all.min.css">

.pull-left[
Today we'll ....

- Recap random forests
- Apply regression forests in Bermuda

]

.pull-right[

<img src="https://nowiknow.com/wp-content/uploads/59f257a793119.image_.jpg" width="450">
]

---

# The Mystery of Bermuda’s Painted Houses

.pull-left[
- Bermuda’s parishes are famous for their **brightly colored homes** —  
  pinks, blues, greens, purples, and oranges.

- But here’s the question:  
  > Do neighborhood **color trends** tell us something deeper  
  > about **who lives there**?

- Could the **proportion of house colors** help predict  
  population size or density?

- Let’s use a **Random Forest model** to find out
]

.pull-right[
<img src="https://nowiknow.com/wp-content/uploads/59f257a793119.image_.jpg" width="450">
]

---

# Bermuda Data: A model for Project 6

https://bcorgbio.github.io/class/data/bermuda_shape.zip

### Step 1 — Load packages

``` r
library(sf)  # spatial data
library(dplyr)  # data wrangling
library(ggplot2)  # plotting
```

### Step 2 — Read the shapefile

``` r
bermuda_house <- st_read("bermuda_shape/bermuda_dat.shp")
```

```
## Reading layer `bermuda_dat' from data source 
##   `/Users/Chris/Dropbox/Documents/bc (1)/3140.f25/lect/bermuda_shape/bermuda_dat.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 348 features and 10 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -64.88471 ymin: 32.24808 xmax: -64.64763 ymax: 32.38866
## Geodetic CRS:  WGS 84
```

---

# Bermuda Data: A model for Project 6

### Step 2 — Read the shapefile

``` r
bermuda_house <- st_read("bermuda_shape/bermuda_dat.shp")
```

`bermuda_house` is now an sf object (spatial data frame) that contains:

- parish and subparish boundaries (shape data)

- total population (`tot_pop`) [🦌]

- color proportions (blue, green, orange, etc.) [land use]

- a geometry column for plotting shapes

---

# Bermuda Data: A model for Project 6

### Step 3 — Take a peek

``` r
head(bermuda_house, 2)
```

```
## Simple feature collection with 2 features and 10 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -64.79452 ymin: 32.28955 xmax: -64.78599 ymax: 32.29718
## Geodetic CRS:  WGS 84
##               name    subparish tot_pop pred_col       blue     green
## 1 City of Hamilton subparish_01    1243    green 0.04278075 0.3101604
## 2 City of Hamilton subparish_02    1348    green 0.31443299 0.3659794
##       orange       pink     purple      white                       geometry
## 1 0.20320856 0.05347594 0.25668449 0.13368984 POLYGON ((-64.78599 32.2904...
## 2 0.08247423 0.07216495 0.08762887 0.07731959 POLYGON ((-64.78599 32.2904...
```
.small[
Each row = one subparish polygon,
ready for analysis and visualization.
]

---

# Bermuda Data: A model for Project 6

### Step 4 — Plot subparishes by color

``` r
# Define color palette
color_map <- c(blue = "#0072B2", green = "#009E73",
    orange = "#E69F00", pink = "#CC79A7",
    purple = "#8A2BE2", white = "#F0F0F0")

# Plot predominant house colors
p <- bermuda_house %>%
    ggplot() + geom_sf(aes(fill = pred_col),
    color = "black", linewidth = 0.2) + scale_fill_manual(values = color_map,
    name = "Predominant Color") + theme_minimal() +
    labs(title = "Predominant House Colors in Bermuda",
        subtitle = "Each polygon represents a subparish")
```

---

# Bermuda Data: A model for Project 6

### Step 4 — Plot subparishes by color

.pull-left[

``` r
plot(p)
```

**What this shows:**

- Each polygon = one subparish

- Fill color = predominant house color

- Boundaries = spatial geometry

]

.pull.right[
![](3140_f25_11-03_files/figure-html/unnamed-chunk-8-1.png)
]

---

# From Colors to Predictions

.pull-left[
### What we have
For each **subparish** in Bermuda:

| Variable | Description |
|-----------|--------------|
| `blue` | Proportion of blue houses |
| `green` | Proportion of green houses |
| `orange` | Proportion of orange houses |
| `pink` | Proportion of pink houses |
| `purple` | Proportion of purple houses |
| `white` | Proportion of white houses |
| `tot_pop` | Total population (our target) |

We’ll train a **Random Forest regression** to predict `tot_pop`  
based on all those color proportions.
]

.pull-right[
### Concept:
.center[
<img src="https://miro.medium.com/v2/resize:fit:1184/format:webp/1*i0o8mjFfCn-uD79-F1Cqkw.png" width="350">

]

**Input:** house color proportions  
**Model:** many decision trees → combined forest  
**Output:** predicted population size

.small[
Each tree finds patterns — together, the forest averages them  
to make robust predictions.
]
]

---

# Bermuda's Random Forest

.pull-left[
### Step 1: Create a clean, numeric dataset

``` r
rf_dat <- bermuda_house %>%
    ungroup() %>%
    st_drop_geometry() %>%
    na.omit() %>%
    select(-name, -pred_col, -subparish)

head(rf_dat)
```

```
##   tot_pop       blue     green     orange       pink     purple      white
## 1    1243 0.04278075 0.3101604 0.20320856 0.05347594 0.25668449 0.13368984
## 2    1348 0.31443299 0.3659794 0.08247423 0.07216495 0.08762887 0.07731959
## 3    1208 0.25294118 0.3529412 0.07058824 0.09411765 0.21764706 0.01176471
## 4    1212 0.08484848 0.2545455 0.06666667 0.20606061 0.14545455 0.24242424
## 5    1271 0.15646259 0.2517007 0.08163265 0.24489796 0.16326531 0.10204082
## 6     208 0.03846154 0.2307692 0.34615385 0.07692308 0.11538462 0.19230769
```
]

.pull-right[
Removes: 
- Geometry (spatial polygons)
- Text columns (name, subparish, etc.)
- Missing values

Leaves:
- tot_pop (target variable)
- Color proportions, i.e. blue, green, orange, etc. (predictors)
]

---

# Bermuda's Random Forest

.pull-left[
### Step 2 -- Train a model

``` r
library(randomForest)
set.seed(123)

rf_mod <- randomForest(
  tot_pop ~ .,
  data = rf_dat,
  ntree = 500,       # number of trees
  mtry = 3,          # predictors per split
  importance = TRUE  # compute variable importance
)
```
]

.pull-right[

Model "learns" how house color proportions relate to population size.

- `tot_pop` ~ . → predict population using all remaining columns
- `ntree = 500` → build 500 trees for stability
- `mtry = 3` → randomly test 3 predictors per split
- `importance = TRUE` → tell the forest to rank predictor importance

.small[
Remember: random forests produce averages of predictions and reduce overfitting by combining many weak "trees" into one strong "forest."
]
]

---

# Bermuda's Random Forest

.pull-left[
### Step 3 — Check performance
.small[

``` r
print(rf_mod)
```
]
]

.pull-right[
Key outputs:
- **MSE (Mean Squared Error):** average difference between predicted and actual populations

- **% Variance Explained:** how much of the population variation the model captures

- **OOB Error (Out-of-Bag):** built-in cross-validation — smaller = better fit
]

.small[

```
## 
## Call:
##  randomForest(formula = tot_pop ~ ., data = rf_dat, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 23381.26
##                     % Var explained: 12.56
```
]

---
# Bermuda's Random Forest

.pull-left[
### What is Out-of-Bag (OOB) Error?

When building each tree in the Random Forest:

1. The model takes a **bootstrap sample**  
   (a random subset of the data *with replacement*).

2. About **2/3 of the data** are used to **train** that tree.

3. The remaining **1/3 of the data** — the **Out-of-Bag (OOB)** cases —  
   are **not seen** by that tree.

Each tree predicts its OOB cases → we can test how well it generalizes!
]

.pull-right[
### Why it matters
- Acts like **built-in cross-validation**  
- Measures model performance **without needing a test set**
- Lower OOB error = better predictive accuracy

.center[
<img src="https://nowiknow.com/wp-content/uploads/59f257a793119.image_.jpg" width="450">
]
]

.small[
Each tree trains on a subset, tests on its “out-of-bag” data →  
average of all errors = **OOB error rate**.
]
]
---

# Bermuda's Random Forest

.pull-left[
### Step 3 — Check performance

Which colors matter most?

``` r
varImpPlot(rf_mod)
```

![](3140_f25_11-03_files/figure-html/unnamed-chunk-13-1.png)

]

.pull-right[
Interpretation:

Taller bars → stronger predictors of population

Colors that vary a lot across space may explain more of the population pattern
]

---
.pull-left[

### Step 4 — Test predictions

``` r
pred <- predict(rf_mod, rf_dat)

# Compare to actual
compare <- data.frame(Actual = rf_dat$tot_pop,
    Predicted = pred)

head(compare)
```

```
##   Actual Predicted
## 1   1243  773.0464
## 2   1348  970.7964
## 3   1208  941.4578
## 4   1212  820.1060
## 5   1271  797.6134
## 6    208  271.8791
```
]

.pull-right[
You can now use the model to:

- estimate populations for new data (house color in other parishes)

- compare predicted vs. observed populations

- visualize prediction accuracy
]

---
### Step 4 — Test predictions

.pull-left[

**Visualizing predictions**

``` r
p <- ggplot(compare, aes(x = Actual, y = Predicted)) +
    geom_point(color = "darkgreen", size = 3) +
    geom_abline(slope = 1, intercept = 0,
        color = "red", lwd = 1) + theme_minimal() +
    labs(x = "Actual Population", y = "Predicted Population",
        title = "Predicted vs. Actual Population")
```
.center[
Points close to the red 1:1 line = accurate predictions ✅
]
]

.pull-right[

``` r
print(p)
```

![](3140_f25_11-03_files/figure-html/unnamed-chunk-16-1.png)
]